{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# $$CatBoost\\ Object\\ Importance\\ Tutorial$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### In this tutorial we show how you can detect noisy objects in your dataset. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from catboost import CatBoost, Pool, datasets\n",
    "from sklearn.model_selection import train_test_split"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "#### First, let's load the dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(24576, 9) (8193, 9)\n"
     ]
    }
   ],
   "source": [
    "train_df, _ = datasets.amazon()\n",
    "X, y = np.array(train_df.drop(['ACTION'], axis=1)), np.array(train_df.ACTION)\n",
    "cat_features = np.arange(9) # indices of categorical features\n",
    "\n",
    "X_train, X_validation, y_train, y_validation = train_test_split(X, y, test_size=0.25, random_state=42)\n",
    "train_pool = Pool(X_train, y_train, cat_features=cat_features)\n",
    "validation_pool = Pool(X_validation, y_validation, cat_features=cat_features)\n",
    "\n",
    "print(train_pool.shape, validation_pool.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Let's train CatBoost on clear data and take a look at the quality:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.2157984851490331\n"
     ]
    }
   ],
   "source": [
    "cb = CatBoost({'iterations': 100, 'verbose': False, 'random_seed': 42})\n",
    "cb.fit(train_pool);\n",
    "print(cb.eval_metrics(validation_pool, ['RMSE'])['RMSE'][-1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Let's inject random noise into 10% of training labels:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.random.seed(42)\n",
    "perturbed_idxs = np.random.choice(len(y_train), size=int(len(y_train) * 0.1), replace=False)\n",
    "y_train_noisy = y_train.copy()\n",
    "y_train_noisy[perturbed_idxs] = 1 - y_train_noisy[perturbed_idxs]\n",
    "\n",
    "train_pool_noisy = Pool(X_train, y_train_noisy, cat_features=cat_features)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### And train CatBoost on noisy data and take a look at the quality:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.25915746122622113\n"
     ]
    }
   ],
   "source": [
    "cb.fit(train_pool_noisy);\n",
    "print(cb.eval_metrics(validation_pool, ['RMSE'])['RMSE'][-1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Now let's sample random 500 validate objects (because counting object importance on the entire validation dataset can take a long time) and calculate the train objects importance for these validation objects:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.random.seed(42)\n",
    "test_idx = np.random.choice(np.arange(y_validation.shape[0]), size=500, replace=False)\n",
    "validation_pool_sampled = Pool(X_validation[test_idx], y_validation[test_idx], cat_features=cat_features)\n",
    "\n",
    "indices, scores = cb.get_object_importance(\n",
    "    validation_pool_sampled,\n",
    "    train_pool_noisy,\n",
    "    importance_values_sign='Positive' # Positive values means that the optimized metric\n",
    "                                      # value is increase because of given train objects.\n",
    "                                      # So here we get the indices of bad train objects.\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Finally, in a loop, let's remove noisy objects in batches, retrain the model, and see how the quality on the test dataset improves:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "RMSE on validation datset when 0 harmful objects from train are dropped: 0.25915746122622113\n",
      "RMSE on validation datset when 250 harmful objects from train are dropped: 0.25601149050939825\n",
      "RMSE on validation datset when 500 harmful objects from train are dropped: 0.25158044983631966\n",
      "RMSE on validation datset when 750 harmful objects from train are dropped: 0.24570533776587475\n",
      "RMSE on validation datset when 1000 harmful objects from train are dropped: 0.24171376432589384\n",
      "RMSE on validation datset when 1250 harmful objects from train are dropped: 0.23716221792112202\n",
      "RMSE on validation datset when 1500 harmful objects from train are dropped: 0.23352830055657348\n",
      "RMSE on validation datset when 1750 harmful objects from train are dropped: 0.23035731488436903\n",
      "RMSE on validation datset when 2000 harmful objects from train are dropped: 0.2275943109556251\n"
     ]
    }
   ],
   "source": [
    "def train_and_print_score(train_indices, remove_object_count):\n",
    "    cb.fit(X_train[train_indices], y_train_noisy[train_indices], cat_features=cat_features)\n",
    "    metric_value = cb.eval_metrics(validation_pool, ['RMSE'])['RMSE'][-1]\n",
    "    s = 'RMSE on validation datset when {} harmful objects from train are dropped: {}'\n",
    "    print(s.format(remove_object_count, metric_value))\n",
    "\n",
    "batch_size = 250\n",
    "train_indices = np.full(X_train.shape[0], True)\n",
    "train_and_print_score(train_indices, 0)\n",
    "for batch_start_index in range(0, 2000, batch_size):\n",
    "    train_indices[indices[batch_start_index:batch_start_index + batch_size]] = False\n",
    "    train_and_print_score(train_indices, batch_start_index + batch_size)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Therefore, we have the following RMSE values on the validation dataset:\n",
    "    \n",
    "||RMSE on the validation dataset|\n",
    "|-|-|\n",
    "|Clear train dataset: | 0.215798485149|\n",
    "|Noisy train dataset: | 0.259157461226|\n",
    "|Purified train dataset: | 0.227594310956|"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### $$So\\ now\\ you\\ can\\ try\\ to\\ clear\\ the\\ train\\ dataset\\ of\\ noisy\\ objects\\ and\\ get\\ better\\ quality!$$"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
